Proper Names In A Semantic Database
نویسندگان
چکیده
Among the resources developed in SI-TAL (Integrated Systems for the Automatic Treatment of Language), ItalWordNet (IWN) was built as reference semantic database, enlarging the Italian WordNet developed in the framework of the European project EuroWordNet (EWN). The Italian lexical database was increased, by introducing and codifying, besides the new grammatical categories of the adjectives and adverbs, a subset of proper names. In the IWN context, the subset of proper names represents a quantitatively limited portion, about 3600 synsets, but it may become a qualitatively important extension. The ever growing amount of non-structured information, stored in natural language, requires the availability of computational instruments able to manage this kind of information where proper names show a remarkable incidence in any types of texts. The work here presented falls in this context, taking into account the proper names, and is focussed on: i) encoding in the IWN database; ii) more typical uses in either proper or metaphorical and metonymic ways such as textual corpora evidence; iii) possibility of a well reasoned and structured enlarging of this data on the basis of the recent experience carried out in IWN. 1. Building the set of proper names in the IWN database IWN was first developed within the EWN project (Vossen, 1999) and then extended in the framework of an Italian national project for automatic treatment of the language SI-TAL. IWN (Roventini et al., 2002) is a lexical-semantic database containing semantic information for about 50,000 synsets of nouns, verbs, adjectives, adverbs, and a subset of proper names. The information is encoded in the form of lexical-semantic relations between pairs of synsets (synonym sets). The most important relations encoded, using machine-readable dictionaries as sources, are synonymy and hyponymy; however a rich linguistic model was designed containing many other lexical-semantic relations which are encoded for various subsets of Italian nouns, verbs and adjectives. All the synsets are also linked to WordNet 1.5, the Princeton Wordnet database (Miller et al., 1990). In the framework of the SI-TAL project the lexical coverage of IWN has been extended by adding, besides two grammatical categories not encoded in EWN (i.e. adjectives and adverbs), a set of proper names which are taken into consideration in this paper. This decision was also due to the high degree of incidence of proper names 1 EWN was a project in the EC Language Engineering (LE4003) programme. Complete information on EWN can be found at its web site: http://www.hum/uva.nl/~ewn. 2 The SI-TAL project : ‘Integrated Systems for the Automatic Treatment of Language’ was a National Project, coordinated by A. Zampolli, devoted to the creation of large linguistic resources and software tools for Italian written and spoken language processing. Besides IWN, the following were developed within the project: a treebank with a three level syntactic and semantic annotation, a system for integrating NL processors in applications for managing grammatical resources, a dialogue annotated corpus for applications of advanced vocal interfaces, software and tools for advanced vocal interfaces. observed in the corpus selected within SI-TAL for semantic annotation. 1.1 Coding proper names In IWN proper names are connected to the class they belong to by means of the Belongs_ to_ class relation. This relation and its reverse Has_instance are only used to link instances with synsets. Indeed in the IWN database, unlike the well known Princeton semantic WordNet (Miller et al., 1990), hyp(er)onymy or ‘is-a’ relation was not used for this part of the lexicon. For proper names the “inherence propositions” between an individual and a class are allowed, not the “relation propositions” which are allowed only between classes (Blanché, 1968). What is denoted by the name belongs to a class, not to the name: using a name is not a matter of representing it as having certain properties but, as Russell (1919) said, “merely to indicate what we are speaking about...”. Moreover, whereas Common Nouns may have some relation with the referent, so that they are almost all the same, the Proper Name is a target that does not depend on the context: “a Name refers to an individual. And once the meaning of the name has been established, a context cannot normally change very much of it.” (Pamp, 1985). However, in IWN there are other relations used to link proper names with common nouns and adjectives: Derivation and Pertains_to. Derivation is a morphological relation, which links the proper name with its derivatives and viceversa. As in EWN, it is used to encode derivation links when no other semantic relation is available. It connects variants belonging to different PoSs and applies both to the first and to the second order entities as shown in the examples below: Grande (wide) Derivation grandemente (widely) Marx Derivation marxismo (Marxism) Romanità Derivation Roma (Rome) The Pertains_to relation and its reverse Has_ pertained, has been used both in WN and in EWN. It allows the link of a noun with a relational adjective. In IWN this relation applies either between synsets or between synsets and istances: it connects 2° order entities with 1° entities, or 2° order entities and istances: dantesco (dantean) Pertains_ to Dante musicale (musical) Pertains_ to musica (music) Also proper names were linked with WordNet 1.5 by means of equivalence relations. The Eq_synonymy is used to link proper names with an equivalent istance in WN; in IWN the Eq_belongs_ to_ class, that was not present in EWN, is used to map proper names to the generic belonging class when they have no equivalent in WordNet. Summing up, the following examples show all the types of relations so far encoded for this subset: Roma Belongs_ to_ class città (city, town) Romano Pertains_ to Roma (Rome) Roma Derivation romanità (Roman world) Roma Eq synonym Rome Lucca Eq_belongs_ to_ class town 1.1.2 Coding geographic names As said above, in order to choose the first nucleus of proper names in IWN, we referred to the corpus selected for the semantic annotation, where we found that geographic names such as Italia, Roma, Milano, New York etc. were among the most frequent in the list of occurrences. Furthermore, geographic names denote ‘entities’ that have a kind of ‘stability’ and originate adjectives and nouns of common type which should be linked to their bases. For all these reasons we decided to start the coding from these kinds of proper names. The geographic names were subdivided into many types of semantic classes (over 25): nation, city, region, sea, lake, river, etc., and all (more then 1300) were linked to the class they belong to by means of the semantic relation Belongs_to_class, for example: Firenze Belongs_to_class città (city) Cuba 1 Belongs_to_class isola (island) Cuba 2 Belongs_to_class nazione (nazione) When coding we noticed that homonymy among nouns denoting different objects occurs either for ‘instances’ belonging to different classes (e.g. Cuba, Washington, New York), or for ‘instances’ belonging to the same class (e.g. Hebron, Tripoli, Cambridge). In the first case, we created one entry for each class. In the second case we used only the definition to distinguish two identical entries, such as, for example, Hebron in Canada and Hebron in Israel; but, in the future, we shall extend to this kind of geographic names the possibility of their being encoded by means of the relation Has_holo_location / Has_mero_location which would make explicit (for automatic applications in NLP) the different places where the homonymous towns are situated. Thus we will have: Hebron Has_holo_location Canada Hebron Has_holo_location Israel Another phenomenon occurring within geographic names is that they have sometimes changed with the passing of time. In these cases the present names have been included in the database, and the older, but better known ones, have been included as variants; see for example the case of {Byrmania, Myanmar} or {Persia, Iran}. 1.1.3 Other kinds of names Moreover from the TAL corpus and from the DMI, a file of over 250 records has been created, made up of names of famous persons that have given origin to adjectives and/or common names (e.g. Ario, Machiavelli, Parkinson, etc.). More than 70 types or classes which these names belong to as instances have been identified: i.e. sculptor (Fidia), painter (Modigliani), character of a novel or a drama (Hamlet), writer (De Amicis), philosopher (Plato), etc.. In a few cases a proper name denoting a person have been codified as instance of more than one class, e.g. ‘Michelangelo’, belongs ‘conjunctively’ to the classes ‘painter’, ‘sculptor’, ‘architect’. A definition has been given to all these personages (about 300) using the De Agostini ‘Compact Encyclopedia’ as a reference. Some files have been extracted also from sources of various type: atlases, web sites, several lists containing names of famous persons, divinities, celestial bodies, etc., useful in extending the lexical coverage of IWN. They have been checked, tidied and reorganised and then added to the set as new entries. Up until now the set of Proper Names contains 3600 instances, belonging to 200 classes. In the table below it is possible to see a few of the most represented classes: Belonging Class No. of “instances” Città (city, town) 556
منابع مشابه
Analysis of User query refinement behavior based on semantic features: user log analysis of Ganj database (IranDoc)
Background and Aim: Information systems cannot be well designed or developed without a clear understanding of needs of users, manner of their information seeking and evaluating. This research has been designed to analyze the Ganj (Iranian research institute of science and technology database) users’ query refinement behaviors via log analysis. Methods: The method of this research is log anal...
متن کاملA sense-based lexicon of count and mass expressions: The Bochum English Countability Lexicon
The present paper describes the current release of the Bochum English Countability Lexicon (BECL 2.1), a large empirical database consisting of lemmata from Open ANC (http://www.anc.org) with added senses from WordNet (Fellbaum, 1998). BECL 2.1 contains ≈ 11,800 annotated noun-sense pairs, divided into four major countability classes and 18 fine-grained subclasses. In the current version, BECL ...
متن کاملA proposal for a multilevel linguistic representation of Spanish personal names
This paper proposes a multilevel representation of personal names, with the aim of offering an economical treatment for these expressions, which makes a clear distinction between ontological information, described in a name database, and linguistic levels of representation. Adopting the linguistic model and formalisms provided within the MeaningóText framework (Mel’čuk 1988), it is argued that...
متن کاملProlexbase: a Multilingual Relational Lexical Database of Proper Names
This paper deals with a multilingual relational lexical database of proper name, Prolexbase, a free resource available on the CNRTL website. The Prolex model is based on two main concepts: firstly, a language independent pivot and, secondly, the prolexeme (the projection of the pivot onto particular language), that is a set of lemmas (names and derivatives). These two concepts model the variati...
متن کاملIdentifying Unknown Proper Names In Newswire Text
The identification of unknown proper names in text is a significant challenge for NLP systems operating on unrestricted text. A system which indexes documents according to name references can be useful for information retrieval or as a preprocessor for more knowledge intensive tasks such as database extraction. This paper describes a system which uses text skimming techniques for deriving prope...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002